-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsexec: cloned binary rework #3987
Conversation
} | ||
path := "/proc/self/fd/" + strconv.Itoa(int(f.Fd())) | ||
if err := unix.Access(path, unix.X_OK); err == nil { | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The binary might be still unexecutable depending on LSM?
Maybe we should just execute the binary and see whether it returns a magic ret code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, though doing a maximum of ~10 extra fork+execs in the absolute worst case seems a bit expensive... This also can't detect every LSM issue (it's possible executing a host /tmp
program will be blocked by the container's LSM setup -- and we cannot detect that this early into container setup).
I think the only true solution for a really serious LSM setup is to disable runc-dmz entirely, or we'll have to create a tmpfs mountfd that allows executables, which would add more mount_lock
lock contention again....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new RUNC_DMZ=legacy
option means that users can work around this if they hit an issue here. I don't think there's a nicer way of dealing with this problem completely tbh.
we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lifubang <[email protected]>
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before.
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lifubang [email protected]
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lifubang [email protected] Signed-off-by: lfbzhm <[email protected]>
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lfbzhm <[email protected]>
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init, due to a Go stdlib bug, we need to add safeExe to the set of ExtraFiles otherwise it is possible for the stdlib to clobber the fd during forkAndExecInChild1 and replace it with some other file that might be malicious. This is less than ideal (because the descriptor will be non-O_CLOEXEC) however we have protections in "runc init" to stop us from leaking extra file descriptors. See <golang/go#61751>. But because of we have added safeExe to the set of ExtraFiles, if the fd of safeExe is too small, go stdlib will dup3 it to another fd, then it will cause the original fd closed. (opencontainers#4294) Signed-off-by: lfbzhm <[email protected]>
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init, due to a Go stdlib bug, we need to add safeExe to the set of ExtraFiles otherwise it is possible for the stdlib to clobber the fd during forkAndExecInChild1 and replace it with some other file that might be malicious. This is less than ideal (because the descriptor will be non-O_CLOEXEC) however we have protections in "runc init" to stop us from leaking extra file descriptors. See <golang/go#61751>. There is a race situation when we are opening this memfd, if the fd 5 or 6 was closed at that time, maybe it will be reused by memfd. But because of we have added safeExe to the set of ExtraFiles, if the fd of safeExe is than stdio fds count + ExtraFiles count, go stdlib will dup3 it to another fd, then it will cause the original fd closed. (opencontainers#4294) Signed-off-by: lfbzhm <[email protected]>
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init, due to a Go stdlib bug, we need to add safeExe to the set of ExtraFiles otherwise it is possible for the stdlib to clobber the fd during forkAndExecInChild1 and replace it with some other file that might be malicious. This is less than ideal (because the descriptor will be non-O_CLOEXEC) however we have protections in "runc init" to stop us from leaking extra file descriptors. See <golang/go#61751>. There is a race situation when we are opening this memfd, if the fd 5 or 6 was closed at that time, maybe it will be reused by memfd. But because of we have added safeExe to the set of ExtraFiles, if the fd of safeExe is than stdio fds count + ExtraFiles count, go stdlib will dup3 it to another fd, then it will cause the original fd closed. (opencontainers#4294) Signed-off-by: lfbzhm <[email protected]>
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init, due to a Go stdlib bug, we need to add safeExe to the set of ExtraFiles otherwise it is possible for the stdlib to clobber the fd during forkAndExecInChild1 and replace it with some other file that might be malicious. This is less than ideal (because the descriptor will be non-O_CLOEXEC) however we have protections in "runc init" to stop us from leaking extra file descriptors. See <golang/go#61751>. There is a race situation when we are opening this memfd, if the fd 5 or 6 was closed at that time, maybe it will be reused by memfd. But because of we have added safeExe to the set of ExtraFiles, if the fd of safeExe is than stdio fds count + ExtraFiles count, go stdlib will dup3 it to another fd, then it will cause the original fd closed. (opencontainers#4294) Signed-off-by: lfbzhm <[email protected]>
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init, due to a Go stdlib bug, we need to add safeExe to the set of ExtraFiles otherwise it is possible for the stdlib to clobber the fd during forkAndExecInChild1 and replace it with some other file that might be malicious. This is less than ideal (because the descriptor will be non-O_CLOEXEC) however we have protections in "runc init" to stop us from leaking extra file descriptors. See <golang/go#61751>. There is a race situation when we are opening this memfd, if the fd 6 or 7 was closed at that time, maybe it will be reused by memfd. But because of we have added safeExe to the set of ExtraFiles, if the fd of safeExe is not bigger than stdio fds count + ExtraFiles count, go stdlib will dup3 it to another fd, then it will cause the original fd closed. (opencontainers#4294) Signed-off-by: lfbzhm <[email protected]>
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init, due to a Go stdlib bug, we need to add safeExe to the set of ExtraFiles otherwise it is possible for the stdlib to clobber the fd during forkAndExecInChild1 and replace it with some other file that might be malicious. This is less than ideal (because the descriptor will be non-O_CLOEXEC) however we have protections in "runc init" to stop us from leaking extra file descriptors. See <golang/go#61751>. There is a race situation when we are opening this memfd, if the fd 6 or 7 was closed at that time, maybe it will be reused by memfd. Because we want to add safeExe to the set of ExtraFiles, if the fd of safeExe is too small, go stdlib will dup3 it to another fd, or dup3 a other fd to this fd, then it will cause the fd type cmd.Path refers to a random path. (issue: opencontainers#4294) Signed-off-by: lfbzhm <[email protected]>
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lfbzhm <[email protected]>
In opencontainers#3987(0e9a335), we may use a memfd to copy run to start runc init, due to a Go stdlib bug, we need to add safeExe to the set of ExtraFiles otherwise it is possible for the stdlib to clobber the fd during forkAndExecInChild1 and replace it with some other file that might be malicious. This is less than ideal (because the descriptor will be non-O_CLOEXEC) however we have protections in "runc init" to stop us from leaking extra file descriptors. See <golang/go#61751>. There is a race situation when we are opening this memfd, if the fd 6 or 7 was closed at that time, maybe it will be reused by memfd. Because we want to add safeExe to the set of ExtraFiles, if the fd of safeExe is too small, go stdlib will dup3 it to another fd, or dup3 a other fd to this fd, then it will cause the fd type cmd.Path refers to a random path. (issue: opencontainers#4294) Signed-off-by: lfbzhm <[email protected]>
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lfbzhm <[email protected]>
This reverts commit 65a1074. we need (opencontainers#4020) because of (opencontainers#3931), at that time, we removed the bindfd logic, and the memfd logic will use more memory than before, but we have not yet moved binary clone from runc init to runc parent process, so we need to increase memory limit in CI. As we have moved the runc binary clone logic from runc init to runc parent process in (opencontainers#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. Now we can support run a simple container with lower memory usage the same as before. Signed-off-by: lfbzhm <[email protected]>
Split out from #3953 to be reviewed separately.
First, Migrate all of the memfd /proc/self/exe logic to Go code.
This allow us to remove the amount of C code in runc quite
substantially, as well as removing a whole execve(2) from the nsexec
path because we no longer spawn "runc init" only to re-exec "runc init"
after doing the clone.
Then, carry the code from #3983 and combine it with the above to
keep all of the
runc-dmz
logic in the Go code as well. Therunc-dmz
feature is optional (controlled by the
runc_nodmz
build tag atbuild time, or
RUNC_DMZ=legacy
at runtime), with a fallback to theclassic
/proc/self/exe
cloning. If/proc/self/exe
is already acloned binary then neither
runc-dmz
nor/proc/self/exe
arecloned.
This also adds a
memfd-bind
helper binary for users which reallyneed to avoid any excess copies. Unfortunately, the semantics of
bind-mounting a memfd are quite ugly (they cannot be bind-mounted
directly -- instead, you must bind-mount the magic-link of a file
descriptor pointing to the memfd, which means you need a long-running
process to keep the magic-link alive).
Fixes #3965
Fixes #3973
Signed-off-by: Aleksa Sarai [email protected]